Method for supervised teaching of a recurrent artificial neural network

ABSTRACT

A method for the supervised teaching of a recurrent neutral network (RNN) is disclosed. A typical embodiment of the method utilizes a large (50 units or more), randomly initialized RNN with a globally stable dynamics. During the training period, the output units of this RNN are teacher-forced to follow the desired output signal. During this period, activations from all hidden units are recorded. At the end of the teaching period, these recorded data are used as input for a method which computes new weights of those connections that feed into the output units. The method is distinguished from existing training methods for RNNs through the following characteristics: (1) Only the weights of connections to output units are changed by learning—existing methods for teaching recurrent networks adjust all network weights. (2) The internal dynamics of large networks are used as a “reservoir” of dynamical components which are not changed, but only newly combined by the learning procedure—existing methods use small networks, whose internal dynamics are themselves completely re-shaped through learning.

This application is the national phase under 35 U.S.C. § 371 of PCTInternational Application No. PCT/EP01/11490 which has an Internationalfiling date of Oct. 5, 2001, which designated the United States ofAmerica.

TECHNICAL FIELD OF THE INVENTION

The present invention relates to the field of supervised teaching ofrecurrent neural networks.

BACKGROUND OF THE INVENTION

Artificial neural networks (ANNs) today provide many established methodsfor signal processing, control, prediction, and data modeling forcomplex nonlinear systems. The terminology for describing ANNs is fairlystandardized. However, a brief review of the basic ideas and terminologyis provided here.

A typical ANN consists of a finite number K of units, which at adiscrete time t (where t=1,2,3 . . . ) have an activation x_(i)(t)·(i=1,. . . , K ). The units are mutually linked by connections with weightsw_(ji), (where i, j=1, . . . , K and where w_(ji) is the weight of theconnection from the i-th to the j-th unit), which typically are assignedreal numbers. A weight w_(ji)=0 indicates that there is no connectionfrom the i-th to the j-th unit. It is convenient to collect theconnection weights in a connection matrix W=(w_(ji))_(j,i=1, . . . K).The activation of the j-th unit at time t+1 is derived from theactivations of all network units at time t by

$\begin{matrix}{{{x_{j}\left( {t + 1} \right)} = {f_{j}\left( {\sum\limits_{{i = 1},\;\ldots\;,\; K}{w_{ji}{x_{i}(t)}}} \right)}},{K\mspace{14mu}{statt}\mspace{14mu} N}} & (1)\end{matrix}$where the transfer function ƒ_(j) typically is a sigmoid-shaped function(linear or step functions are also relatively common). In mostapplications, all units have identical transfer functions. Sometimes itis beneficial to add noise to the activations. Then (1) becomes

$\begin{matrix}{{{x_{j}\left( {t + 1} \right)} = {{f_{j}\left( {\sum\limits_{{i = 1},\;\ldots\;,\; K}{w_{ji}{x_{i}(t)}}} \right)} + {v(t)}}},{K\mspace{14mu}{statt}\mspace{14mu} N}} & \left( 1^{\prime} \right)\end{matrix}$where v(t) is an additive noise term.

Some units are designated as output units; their activation isconsidered as the output of the ANN. Some other units may be assigned asinput units; their activation x_(i)(t) is not computed according to (1)but is set to an externally given input u_(i)(t), i.e.x _(i)(t)=u _(i)(t)  (2)in the case of input units.

Most practical applications of ANNs use feedforward networks, in whichactivation patterns are propagated from an input layer through hiddenlayers to an output layer. The characteristic feature of feedforwardnetworks is that there are no connection cycles. In formal theory,feedforward networks represent input-output functions. A typical way toconstruct a feedforward network for a given functionality is to teach itfrom a training sample, i.e. to present it with a number of correctinput-output-pairings, from which the network learns to approximatelyrepeat the training sample and to generalize to other inputs not presentin the training sample. Using a correct training sample is calledsupervised learning. The most widely used supervised teaching method forfeedforward networks is the backpropagation algorithm, whichincrementally reduces the quadratic output error on the training sampleby a gradient descent on the network weights. The field had itsbreakthrough when efficient methods for computing the gradient becameavailable, and is now an established and mature subdiscipline of patternclassification, control engineering and signal processing.

A particular variant of feedforward networks, radial basis functionnetworks (RBF networks), can be used with a supervised learning methodthat is simpler and faster than backpropagation. (An introduction to RBFnetworks is given in the article “Radial basis function networks” by D.Lowe, in: Handbook of Brain Theory and Neural Networks, M. A. Arbib(ed.), MIT Press 1995, p. 779-782) Typical RBF networks have a hiddenlayer whose activations are computed quite differently from (1). Namely,the activation of the j-th hidden unit is a functiong_(j)(∥u−v_(j)∥)  (3)of the distance between the input vector u from some reference vectorv_(j). The activation of output units follows the prescription (1),usually with a linear transfer function. In the teaching process, theactivation mechanism for hidden units is not changed. Only the weightsof hidden-to-output connections have to be changed in learning. Thisrenders the learning task much simpler than in the case ofbackpropagation: the weights can be determined off-line (afterpresentation of the training sample) using linear regression methods, orcan be adapted on-line using any variant of mean square errorminimization, for instance variants of the least-mean-square (LMS)method.

If one admits cyclic paths of connections, one obtains recurrent neuralnetworks (RNNs). The hallmark of RNNs is that they can supportself-exciting activation over time, and can process temporal input withmemory influences. From a formal perspective, RNNs realize nonlineardynamical systems (as opposed to feedforward networks which realizefunctions). From an engineering perspective, RNNs are systems with amemory. It would be a significant benefit for engineering applicationsto construct RNNs that perform a desired input-output-dynamics. However,such applications of RNNs are still rare. The major reason for thisrareness lies in the difficulty of teaching RNNs. The state of the artin supervised RNN learning is marked by a number of variants of thebackpropagation through time (BPTT) method. A recent overview isprovided by A. F. Atiya and A. G. Parlos in the article “New Results onRecurrent Network Training: Unifying the Algorithms and AcceleratingConvergence”, IEEE Transactions on Neural Networks, vol. 11 No 3 (2000),697-709. The intuition behind BPTT is to unfold the recurrent network intime into a cascade of identical copies of itself, where recurrentconnections are re-arranged such that they lead from one copy of thenetwork to the next (instead back into the same network). This“unfolded” network is, technically, a feedforward network and can beteached by suitable variants of teaching methods for feedforwardnetworks. This way of teaching RNNs inherits the iterative,gradient-descent nature of standard backpropagation, and multiplies itsintrinsic cost with the number of copies used in the “unfolding” scheme.Convergence is difficult to steer and often slow, and the singleiteration steps are costly. By force of computational costs, onlyrelatively small networks can be trained. Another difficulty is that theback-propagated gradient estimates quickly degrade in accuracy (going tozero or infinity), thereby precluding the learning of memory effects oftimespans greater than approx. 10 timesteps. These and otherdifficulties have so far prevented RNNs from being widely used.

SUMMARY OF THE INVENTION

The present invention presents a novel method for the supervisedteaching of RNNs. The background intuitions behind this method are quitedifferent from existing BPTT approaches. The latter try to meet thelearning objective by adjusting every weight within the network, therebyattaining a minimal-size network in which every unit contributesmaximally to the desired overall behavior. This leads to a small networkthat performs a particular task. By contrast, the method disclosed inthe present invention utilizes a large recurrent network, whose internalweights (i.e. on hidden-to-hidden, input-to-hidden, or output-to-hiddenconnections) are not changed at all. Intuitively, the large, unchangednetwork is used as a rich “dynamical reservoir” of as many differentnonlinear dynamics as there are hidden units. Another perspective onthis reservoir network is to view it as an overcomplete basis. Only thehidden-to-output connection weights are adjusted in the teachingprocess. By this adjustment, the hidden-to-output connections acquirethe functionality of a filter which distills and re-combines from the“reservoir” dynamical patterns in a way that realizes the desiredlearning objective.

A single instantiation of the “reservoir” network can be re-used formany tasks, by adding new output units and separately teaching theirrespective hidden-to-output weights for each task. After learning,arbitrarily many such tasks can be carried out in parallel, using thesame single instantiation of the large “reservoir” network. Thereby, theoverall cost of using an RNN set up and trained according to the presentinvention is greatly reduced in cases where many different tasks have tobe carried out on the same input data. This occurs e.g. when a signalhas to processed by several different filters.

The temporal memory length of RNNs trained with the method of theinvention is superior to existing methods. For instance, “short termmemories” of about 100 time steps are easily achievable with networks of400 units. Examples of this are described later in this document(Section on Examples).

The invention has two aspects: (a), architectural (structure of the RNN,its setup and initialization), and (b), procedural (teaching method).Both aspects are interdependent.

Dynamical Reservoir (DR)

According to one architectural aspect of the invention, there isprovided a recurrent neural network whose weights are fixed and are notchanged by subsequent learning. The function of this RNN is to serve asa “reservoir” of many different dynamical features, each of these beingrealized in the dynamics of the units of the network Henceforward, thisRNN will be called the dynamical reservoir, and abbreviated by DR.

Preferably, the DR is large, i.e. has in the order of 50 or more (noupper limit) units.

Preferably, the DR's spontaneous dynamics (with zero input) is globallystable, i.e. the DR converges to a unique stable state from everystarting state.

In applications where the processed data has a spatial structuring(e.g., video images), the connectivity topology of the DR may also carrya spatial structure.

Input Presentation

According to another architectural aspect of the invention,n-dimensional input u (t) at time t(t=1,2,3 . . . ) is presented to theDR by any means such that the DR is induced by the input to exhibit arich excited dynamics.

The particular way in which input is administered is of no concern forthe method of the invention. Some possibilities which are traditionallyused in the RNN field are now briefly mentioned.

Preferably, the input is fed into the DR by means of extra input units.The activations of such input units is set to the input u(t) accordingto Eq. (2). In cases where the input has a spatiotemporal character(e.g., video image sequences), the input units may be arranged in aparticular spatial fashion (“input retina”) and connected to the DR in atopology-preserving way. Details of how the weights of the input-to-DRunits are determined, are given in the “detailed description ofpreferred embodiments” section.

Alternatively, input values can be fed directly as additive componentsto the activations of the units of the DR, with or without spatialstructuring.

Alternatively, the input values can be coded before they are presentedto the DR. For instance, spatial coding of numerical values can beemployed.

Reading Out Output

According to another architectural aspect of the invention,m-dimensional output y(t) at time t is obtained from the DR by readingit out from the activations of m output units (where m≧1). Byconvention, the activations of the output units shall be denoted byy₁(t), . . . , y_(m)(t).

In a preferred embodiment of the invention, these output units areattached to the DR as extra units. In this case (i.e., extra outputunits), there may also be provided output-to-DR connections which feedback output unit activations into the DR network. Typically, no suchfeedback will be provided when the network is used as a passive devicefor signal processing (e.g., for pattern classification or forfiltering). Typically, feedback connections will be provided when thenetwork is used as an active signal generation device. Details of how todetermine feedback weights are described in the “detailed description ofpreferred embodiments” section.

According to another architectural aspect of the invention, theactivation update method for the m outputs y₁(t), . . . , y_(m)(t) is ofthe form given in equation (1), with transfer functions ƒ₁, . . . ,ƒ_(m). The transfer functions ƒ_(j) of output units typically will bechosen as sigmoids or as linear functions.

FIG. 1 provides an overview of a preferred embodiment of the invention,with extra input and output units. In this figure, the DR [1] isreceiving input by means of extra input units [2] which feed input intothe DR through input-to-DR connections [4]. Output is read out of thenetwork by means of extra output units [3], which in the example of FIG.1 also have output-to-DR feedback connections [7]. Input-to-DRconnections [4] and output-to-DR feedback connections [7] are fixed andnot changed by training. Finally, there are DR-to-output connections [5]and [possibly, but not necessarily] input-to-output connections [6]. Theweights of these connections [5], [6] are adjusted during training.

Next, the procedural aspects of the invention (teaching method) arerelated. As with all supervised teaching methods for RNNs, it is assumedthat a training sequence is given. The training sequence consists of twotime series u(t) and {tilde over (y)}(t), where t=1,2, . . . , N. It istacitly understood that in cases of online learning, N need not bedetermined at the outset of the learning; the learning procedure is thenan open-ended adaptation process. u(t) is an n-dimensional input vector(where n≧0 i.e., the no-input case n=0 is also possible), and {tildeover (y)}(t) is an m-dimensional output vector (with m≧1). The two timeseries u(t) and {tilde over (y)}(t) represent the desired, to-be-learntinput-output behavior. As a special case, the input sequence u(t) may beabsent; the learning task is then to learn a purely generative dynamics.

The training sequences u(t), {tilde over (y)}(t) are presented to thenetwork for t=1,2, . . . , N. At every time step, the DR is updated(according to the chosen update law, e.g., Equation (1)), and theactivations of the output units are set to the teacher signal {tildeover (y)}(t) (teacher forcing).

The method of the invention can be accommodated off-line learning andon-line learning.

In off-line learning, both the activation vector x(t) ofnon-output-units and the teacher signal {tilde over (y)}(t) arecollected for t=1,2, . . . , N. From these data, at time N there arecalculated weights w_(ji) for connections leading into the output units,such that the mean square error

$\begin{matrix}{{{E\left\lbrack ɛ_{j}^{2} \right\rbrack} = {\frac{1}{N - 1}{\sum\limits_{t = 1}^{N - 1}\left( {{f_{j}^{- 1}\left( {{\overset{\sim}{y}}_{j}\left( {t + 1} \right)} \right)} - \left\langle {w_{j},{x(t)}} \right\rangle} \right)^{2}}}}\mspace{11mu}} & (4)\end{matrix}$Index j nach f^−1 is minimized for every output unit j=1, . . . , m overthe training sequence data In equation (4), <w_(j),x(t)> denotes theinner productw_(j1)u₁(t)+ . . . +w_(jn)u_(n)(t)+w_(j,n+1)x₁(t)+ . . .+w_(j,n+K)x_(K)(t)+w_(j,n+K+1)y₁(t)+ . . . +w_(j,n+K+m)y_(m)(t),  (5)this form of <w_(j),x(t)> being given if there are extra input units.The calculation of weights which minimize Eq. (4) is a standard problemof linear regression, and can be done with any of the well-knownsolution methods for this problem. Details are given in the Section“Detailed description of preferred embodiments”.

The weights w_(ji) are the final result of the procedural part of themethod of the invention. After setting these weights in the connectionsthat feed into the output units, the network can be exploited.

In online-learning variants of the invention, the weights w_(j) areincrementally adapted. More precisely, for j=1, . . . , m, the weightsw_(j)(t) are updated at every time t₀=1,2, . . . , N by a suitableapplication of any of the many well-known methods that adaptively andincrementally minimize the mean square error up to time t₀,

$\begin{matrix}{{E\left\lbrack {ɛ_{j}^{2}\left( t_{0} \right)} \right\rbrack} = {\frac{1}{t_{0} - 1}{\sum\limits_{t = 1}^{t_{0} - 1}{\left( {{f^{- 1}\left( {{\overset{\sim}{y}}_{j}\left( {t + 1} \right)} \right)} - \left\langle {w_{j},{x(t)}} \right\rangle} \right)^{2}.}}}} & \left( {4a} \right)\end{matrix}$

Adaptive methods that minimize this kind of error are known collectivelyunder the name of “recursive least squares” (RLS) methods.Alternatively, from a statistical perspective one can also minimize thestatistically expected square errorE[ε _(j) ² ]=E[ƒ ⁻¹({tilde over (y)} _(j)(t+1))−<w _(j) ,x(t)>],  (4b)where on the right-hand side E denotes statistical expectation. Adaptivemethods that minimize (4b) are stochastic gradient descent methods, ofwhich there are many, among them Newton's method and the most popular ofall MSE minimization methods, the LMS method. However, the LMS method isnot ideally suited to be used with the method of the invention. Detailsare given in the Section “Detailed description of preferredembodiments”.

BRIEF DESCRIPTION OF THE FIGURES

The provided Figures are, with the exception of FIG. 1, illustrations ofthe examples described below. They are referenced in detail in thedescription of the examples. Here is an overview of the Figures.

FIG. 1 is a simplified overview of a preferred embodiment of theinvention.

FIG. 2 shows various data sets obtained from the first example, asimplistic application of the method of the invention to obtain a Sinegenerator network, which is reported for didactic reasons.

FIG. 3 shows various data sets obtained from the second example, anapplication of the method of the invention to obtain a short time memorynetwork in the form of a delay line.

FIG. 4 shows the connectivity setup and various data sets obtained fromthe third example, an application of the method of the invention toobtain a model of an excitable medium trained from a single “soliton”teacher signal.

FIG. 5 shows various data sets obtained from the fourth example, anapplication of the method of the invention to learn a chaotic timeseries generator.

FIG. 6 illustrates the fifth example, by providing a schematic setup ofa network applied to learning a state feedback tracking controller for apendulum, and various data sets obtained in this example.

FIG. 7 shows various data sets obtained from the sixth example, anapplication of the method of the invention to learn a bidirectionaldevice which can be used as a frequency meter or a frequency generator.

DESCRIPTION OF SOME EXAMPLES

Before the invention is described in detail in subsequent sections, itwill be helpful to demonstrate the invention with some exemplaryembodiments. The examples are selected to highlight different basicaspects of the invention.

Example 1 A Toy Example to Illustrate Some Basic Aspects of theInvention

This example demonstrates the basic aspects of the invention with a toyexample. The task is to teach a RNN to generate a sine wave signal.Since this task is almost trivial, the size of the DR was selected to beonly 20 units (for more interesting tasks, network sizes should besignificantly greater).

First, it is shown how the network architecture was set up. The 20 unitswere randomly connected with a connectivity of 20%, i.e., on averageevery unit had connections with 4 other units (including possibleself-connections). The connection weights were set randomly to either0.5 or −0.5.

This network was left running freely. FIG. 2 a shows a trace of 8arbitrarily selected units in the asymptotic activity. It is apparentthat all DR units are entrained a low-amplitude oscillation.

According to the architectural aspects of the invention, an autonomousself-excitation of the DR is not desired. The DR's autonomous dynamicsshould be globally stable, i.e., converge to a stable all-zero statefrom any initial starting state. Therefore, the weights were decreasedby a factor of 0.98, i.e., a weight that was previously 0.5 was put to0.49. FIG. 2 b shows a 200 step trace obtained after 200 initial stepsafter starting the network in a random initial state. It is apparentthat with the new weights the network's dynamics is globally stable,i.e. will asymptotically decay to all zero activations.

This global stability is only marginal in the sense that a slightincrease of weights would render the dynamics unstable (in this case,oscillation would set in by an increase of absolute weight values from0.49 to 0.5). A marginal global stability in this sense is often thedesired condition for the setup of the DR according to the invention.

Next, the response characteristics of the DR is probed. To this end, anextra input unit was attached. It was completely connected to the DR,i.e., a connection was established from the input unit to every of the20 units of the DR. The connection weights were set to values randomlytaken from the interval [−2, 2]. FIG. 2 c shows the response of thenetwork to a unit impulse signal given at time t=10. The first sevenplots in FIG. 2 c show activation traces of arbitrarily selected DRunits. The last plot shows the input signal. It becomes apparent thatthe DR units show a rich variety of response dynamics. This is thedesired condition for the setup of DRs according to the invention.

Next, the response of the DR network to a sine input was probed.Analogous to FIG. 2 c, FIG. 2 d shows the asymptotic response of sevenDR units and the input signal. This Figure again emphasizes the richvariety of responses of DR units. Finally, the network was trained togenerate the same sine signal that was administered previously as input.The extra unit that was previously used as input unit was left unchangedin its connections to the DR, but now was used as an output unit.Starting from an all zero activation, the network was first run for 100steps with teacher forcing to settle initial transients. The, it was runanother 500 steps with teacher forcing. The activation values of the 20DR units were recorded for these 500 steps. At time t=600, an offlinelearning of weights from the DR to the output unit was performed, i.e.,the DR-to-output weights were computed as the solutions of a linearregression of the desired output values to the DR states, minimizing themean square error of Equation (4). Thereafter, teacher forcing wasswitched off, and the network was left run freely for another 10,000steps. After that, 50 steps were plotted to obtain FIG. 2 e. Here, theeighth plot shows the activation of the output unit. Unsurprisingly,FIG. 2 e is virtually the same as FIG. 2 d. FIG. 2 f shows asuperposition of the output with the teacher (but unknown to thenetwork) signal:teacher signal=solid line, network output=dashed line.The dashed is identical to the solid line at the plotting resolution; infact, the numerical value of the mean square error (4) was 1.03×10⁻¹³for this (simple) learning task.

Example 2 A Short Time Memory

In this example it is shown how the method of the invention can be usedto teach an RNN to produce delayed versions of the input.

The network was set up as in FIG. 1. The DR had a size of 100 units. Itwas randomly connected with a connectivity of 5%. Nonzero weights wereset to +0.45 or −0.45 with equal probability. This resulted in aglobally stable dynamics of the DR (again, of marginal stability:increasing absolute values of weights to 0.475 would destroy globalstability). The impulse response of the DR's units to a unit impulsewere qualitatively similar to the ones in example 1 (cf. FIG. 2 c) andare not shown.

One input unit was attached to the DR, by connecting the input unit toevery unit of the DR. Weights of these connections were randomly set to0.001 or −0.001 with equal probability.

Furthermore, three extra output units were provided, with nooutput-to-DR feedback connections.

The learning task consisted in repeating in the output node the inputsignal with delays of 10, 20, 40 time steps. The input signal used wasessentially a random walk with a banded, nonstationary frequencyspectrum. FIG. 3 a shows a 50-step sequence of the input (solid line)and the correct delayed signal (teacher signal) of delay 10 (dashedline).

The network state was randomly initialized. The input was then presentedto the network for 700 update steps. Data from the first 200 updatesteps were discarded to get rid of initial transient effects. Data fromthe remaining 500 update steps were collected and used with the off-lineembodiment of the learning method of the invention. The result wereweights for the connections from DR and input units to output units. Thenetwork run was continued with the learnt weights for another 150 updatesteps. The input and outputs of the last 50 update steps are plotted inFIG. 3( b). The three plots show the correct delayed signal (solid)superimposed on the outputs generated by the learnt network (dashed). Itbecomes apparent that the network has successfully learnt to delay asignal even for as long as 40 time steps.

In order to quantify the precision of the learnt network output, themean square error of each of the three output units was calculated froma sample sequence. They were found to be 0.0012, 0.0013, 0.0027 for thedelays of 10, 20, 40, respectively.

Comment. The challenge of this learning task is that the network has toserve as a temporal memory. This goal is served by two aspects of thesetup of the network for learning. First, the autonomous dynamics of theDR was tuned such that it was globally stable only by a small margin.The effect is that dynamic aftereffects of input die out slowly, whichenhances the temporal memory depth. Second, the input-to-DR connectionshad very small weights. The effect was that the ongoing (memory-serving)activation within the DR net is only weakly modulated, such thatmemory-relevant “repercussions” are not too greatly disturbed byincoming input.

Example 3 Learning an Excitable Medium

In this example it is demonstrated how the method of the invention canbe used to train a 2-dimensional network to support the dynamics of anexcitable medium.

The network was set up as in FIGS. 4 a,b. It consisted of two layers of100 units, which were each arranged in a 10×10 grid. To avoid dealingwith boundary conditions, the grid was topologically closed into atorus. The first layer was used as the DR, the second layer was theoutput layer.

A local connectivity pattern was provided, as follows. Each unit of thefirst layer received connections from locally surrounding units withinthat layer (FIG. 4 a). The weights were set depending on the distance r1between units, as shown in FIG. 4 c. The resulting internal DR dynamicsis depicted in FIG. 4 d, which shows the response of 8 arbitrarilyselected units of the first layer to a unit impulse fed into the firstunit at timestep 10. It can be seen that the DR dynamics dies out, i.e.,it is globally stable.

Each unit of the first layer additionally received connections fromoutput units that lied in a local neighborhood of radius r2. Thedependency of weights from distance r2 is shown in FIG. 4 e.

Among all possible connections from the DR to a particular output unit,only the ones within a grid distance r3 was less or equal to 4 (FIG. 4b) had to be trained. The goal of learning was to obtain weights forthese DR-to-output connections.

No input was involved in this learning task.

The teaching signal consisted in a “soliton” wave which wasteacher-forced on the output layer. The soliton slowly wandered withconstant speed and direction across the torus. FIG. 4 f shows foursuccessive time steps of the teacher signal. Note the effects of torustopology in the first snapshot.

The teaching proceeded as follows. The DR network state was initializedto all zeros. The network was then run for 60 time steps. The DR unitswere updated according to Equation (1), with a sigmoid transfer functionƒ=tanh. The output units were updated by teacher forcing, i.e., theteacher signal shown in FIG. 4 f was written into the output units. Datafrom the first 30 time steps were discarded, and the data collected fromthe remaining 30 time steps were used for the off-line embodiment of thelearning method of the invention. The result were weights for theconnections from DR units to output units. A speciality of this learningtask is that the result of the teaching should be spatially homogeneous,i.e., all output units should be equipped with the same set of weights.This allowed that the data obtained from all 100 output units could bepooled for the learning method of the invention, i.e. a training sampleof effectively 100×30=3000 pairings of network states and desiredoutputs were used to calculate the desired weight set.

To get an impression of what the network has learnt, severaldemonstration runs were performed with the trained network.

In the first demonstration, the network was teacher-forced with thesoliton teacher for an initial period of 10 time steps. Then the teacherforcing was switched off and the network was left running freely for 100further steps. FIG. 4 g shows snapshots taken at time steps 1, 5, 10,20, 50, 100 from this free run. The initially forced soliton persistsfor some time, but then the overall dynamics reorganizes into a stable,symmetric pattern of two larger solitons that wander across the toruswith the same speed and direction as the training soliton.

In other demonstrations, the network was run from randomized initialstates without initial teacher forcing. After some time (typically lessthan 50 time steps), globally organized, stable patterns of travellingwaves emerged. FIG. 4 h shows a smooth and a rippled wave pattern thatemerged in this way.

Comment. This example highlights how the method of the invention appliesto spatial dynamics. The learning task actually is restricted to asingle output unit; the learnt weights are copied to all other outputunits due to the spatial homogeneity condition that was imposed on thesystem in this example. The role of the DR is taken by the hidden layer,whose weights in this case were not given randomly (as in the previousexamples) but were designed according to FIG. 4 c.

Example 4 Learning a Chaotic Oscillator: the Lorenz Attractor

In this example it is shown how the method of the invention can be usedfor the online-learning of a chaotic oscillator, in the presence ofnoise in the teaching signal.

The network was set up with a randomly and sparsely connected DR (80units, connectivity 0.1, weights +0.4 or −0.4 with equal probability)and a single output unit (output-to-DR feedback connections with fullconnectivity, random weights drawn from uniform distribution over [−2,2]). The update rule was a “leaky integration” variant of Eq. (1), whichuses a “potential” variable v to mix earlier states with the currentstate:

$\begin{matrix}{{{x_{j}\left( {t + 1} \right)} = {f\left( {v_{j}\left( {t + 1} \right)} \right)}}{{v_{j}\left( {t + 1} \right)} = {{\left( {1 - a_{j}} \right)\left( {\sum\limits_{{i = 1},\;\ldots\;,\; N}{w_{ji}{x_{i}(t)}}} \right)} + {a_{j}{v_{j}(t)}}}}} & (6)\end{matrix}$

A transfer function ƒ=tanh was used. The leaking coefficients a_(j) werechosen randomly from a uniform distribution over [0, 0.2].

As in the previous examples, this setup resulted in an RNN with marginalglobal stability and a rich variety in impulse responses of theindividual units.

The 1-dimensional teaching signal was obtained by projecting thewell-known 3-dimensional Lorenz attractor on its first dimension. Asmall amount of noise was added to the signal. A delay-embeddingrepresentation of the noisy teacher signal is shown in FIG. 5 a, and ofthe teacher signal without noise in FIG. 5 b. The learning task was toadapt the DR-to-output weights (using the noisy training signal) suchthat the neural network reproduced in its output unit dynamics the(noise-free) Lorenz trace.

The output weights were trained according to the method of theinvention. For demonstration purposes, three variants are reported here:(a) offline learning, (b) online learning with the RLS method, (c)online learning with the LMS method.

Offline learning. The network state was initialized to all zero. Theinput was then presented to the network, and the correct teacher outputwas written into the three output nodes (teacher forcing) for 5100update steps. Data from the first 100 update steps were. Data from theremaining 5000 update steps with teacher forcing were collected and usedto determine DR-to-output weights with minimal MSE (Eq. (4)) with alinear regression computation. The MSE (4) incurred was 0.000089 (thetheoretically possible minimum mean square error, stemming from thenoise component in the signal, would be 0.000052). A time seriesgenerated by the trained is shown in FIG. 5 c.

Online learning with the RLS method. The “recursive least squares”method can be implemented in many variants. Here, the version from thetextbook B. Farhang-Boroujeny, Adaptive Filters: Theory andApplications, Wiley & Sons 1999, p. 423 was used. The same DR was usedas in the offline learning version. The “forgetting rate” required byRLS was set to λ=0.9995. FIG. 5 d shows the learning curve (developmentof log₁₀(ε²), low-pass filtered by averaging over 100 steps per plotpoint). The error converges to the final misadjustment level ofapproximately 0.000095 after about 1000 steps, which is slightly worsethan in the offline trial. FIG. 5 e shows a time series generated by thetrained network.

Online learning with the LMS method. The least mean squares method isvery popular due to its robustness and simplicity. However, as wasalready mentioned in the “Summary of the Invention”, it is not ideal inconnection with the method of the invention. The reason is that DR statevectors have large Eigenvalue spreads. Nevertheless, for illustration ofthis fact, the LMS method was carried out. The LMS method updatesweights at every time step according to:w _(ji)(t+1)=w _(ji)(t)+μεx _(i)(t),  (7)where μ is a learning rate, j is the index of the output unit,ε=ƒ⁻¹({tilde over (y)}_(j)(t))−ƒ⁻¹(y_(j)(t)) is the output unit stateerror, i.e. the difference between the (ƒ-inverted) teacher signal{tilde over (y)}_(j)(t) and the (ƒ-inverted) output unit signaly_(j)(t).

The network was adapted in five successive epochs with decreasinglearning rates μ:1. μ=0.03, N=1000 steps, 2. μ=0.01, N=10.000, 3.μ=0.003, N=50.000, 4. μ=0.001, N=100.000, 5. μ=0.0003, N=200.000. At theend of the fifth epoch, a mean square error E[ε²]≈0.000125 was reached.FIG. 5 f shows the learning curve (all epochs joined), and FIG. 5 gshows a time series generated by the trained network. It is apparentthat the trained network produces a point attractor instead of a chaoticattractor. This highlights the fact that the LMS method is ill-suitedfor training DR-to-output weights. A closer inspection of the Eigenvaluedistribution of the covariance matrix of state vectors x(t) of thetrained network reveals that the Eigenvalue spread is very high indeed:λ_(max)/λ_(min)≈3×10⁸. FIG. 5 h gives a log plot of the Eigenvalues ofthis matrix. Eigenvalue distributions like this are commonly found inDRs which are prepared as sparsely connected, randomly weighted RNNs.

Example 5 A Direct/State Feedback Controller

In this example it is shown how the method of the invention can be usedto obtain a state feedback neurocontroller for tracking control of adamped pendulum.

The pendulum was simulated in discrete time by the difference equationω(t+δ)=ω(t)+δ(−k ₁ω(t)−k ₂sin(φ(t))+u(t)+v(t)) φ(t+δ)=φ(t)+δω(t)   (8)where ω is the angular velocity, φ is the angle, δ is the timestepincrement, u(t) is the control input (torque), and v(t) is uncontrollednoise input. The constants were set to k₁=0.5, k₂=1.0, δ=0. 1, and thenoise input was taken from a uniform distribution in [−0.02, 0.02].

The task was to train a tracking controller for the pendulum. Morespecifically, the trained controller network receives a two-steps-aheadreference trajectoryy_(ref)(t+2δ)=(x_(1ref)(t+2δ),x_(2ref)(t+2δ),ω_(ref)(t+2δ)), wherex_(1ref)(t+2δ),x_(2ref)(t+2δ) are the desired position coordinates ofthe pendulum endpoint and ω_(ref)(t+2δ) is the desired angular velocity.The length of the pendulum was 0.5, so x_(1ref)(t+2),x_(2ref)(t+2) rangein [−0.5,0.5]. Furthermore, the controller receives state feedbacky(t)=(x₁(t),x₂(t),ω(t)) of the current pendulum state. The controllerhas to generate a torque control input u(t) to the pendulum such thattwo update steps after the current time t the pendulum tracks thereference trajectory. FIG. 6 a shows the setup of the controller in theexploitation phase.

For training, a 500-step long teacher signal was prepared by simulatingthe pendulum's response to a time-varying control input ũ(t), which waschosen as a superposition of two random banded signals, one with highfrequencies and small amplitude, the other with low frequencies and highamplitude. FIG. 6 c shows the control input ũ(t) used for the trainingsignal, FIG. 6 d shows the simulated pendulum's state answer x₂(t) andFIG. 6 e the state answer ω(t) (the state answer component x₁(t) looksqualitatively like x₂(t) and is not shown). The training signal for thenetwork consisted of inputs y(t)=(x₁(t),x₂(t),ω(t)) andy(t+2δ)=(x₁(t+2δ),x₂(t+2δ), ω(t+2δ)); from these inputs, the network hadto learn to generate as its output u(t)·u(t) stat\omega(t) FIG. 6 bshows the training setup.

The network was set up with the same 100-unit DR as in the previous(Lorenz attractor) example. 6 external input units were sparsely(connectivity 20%) and randomly (weights +0.5, −0.5 with equalprobability) attached to the DR, and one output unit was providedwithout feedback connections back to the DR. The network update rule wasthe standard noisy sigmoid update rule (1′) for the internal DR units(noise homogeneously distributed in [−0.01, +0.01]). The output unit wasupdated with a version of Eq. (1) where the transfer function was theidentity (i.e., a linear unit). The DR-to-output weights were computedby a simple linear regression such that the error ε(t)=ũ(t)−u(t) wasminimized in the mean square sense over the training data set (N=500),as indicated in FIG. 6 b.

In a test, the trained network was presented with a target trajectoryy_(ref)(t+2δ)=(x_(1ref)(t+2δ),x_(2ref)(t+2δ),ω_(ref)(t+2δ)) at the 3units which in the training phase received the inputy(t+2δ)=(x₁(t+2δ),x₂(t+2δ),ω(t+2δ)). The network further received statefeedback y(t)=(x, (t), X₂(t), CD (t)) from the pendulum at the 3 unitswhich received the signals y(t)=(x₁(t),x₂(t),ω(t)) during training. Thenetwork generated a control signal u(t) which was fed into the simulatedpendulum. FIG. 6 f shows the network output u(t); FIG. 6 g shows asuperposition of the reference x_(2ref)(t+2δ) (solid line) with the2-step-delayed pendulum trajectory x₂(t+2δ) (dashed line); FIG. 6 hshows a superposition of the reference ω_(ref)(t+2δ) (solid line) withthe 2-step-delayed pendulum trajectory ω(t+2δ) (dashed line). Thenetwork has learnt to function as a tracking controller.

Discussion. The trained network operates as a dynamical state feedbacktracking controller. Analytic design of perfect tracking controllers forthe pendulum is not difficult if the system model (8) is known. Thechallenge in this example is to learn such a controller without aprioriinformation from a small training data set.

The approach to obtain such a controller through training of a recurrentneural network ist novel and represents a dependent claim of theinvention. More specifically, the claim is a method to obtainclosed-loop tracking controllers by training of a recurrent neuralnetwork according to the method of the invention, where (1) the inputtraining data consists of two vector-valued time series of the formy(t+Δ),y(t), where y(t+Δ) is a future version of the variables that willserve as a reference signal in the exploitation phase, and y(t) arestate or observation feedback variables (not necessarily the same as iny(t+Δ)), (2) the output training data consists in a vector ũ(t), whichis the control input presented to the plant in order to generate thetraining input data y(t+Δ),y(t).

Example 6 A Two-Way Device: Frequency Generator+Frequency Meter

In this example it is shown how the method of the invention can be usedto obtain a device which can be used in two ways: as a tunable frequencygenerator (input: frequency target, output: oscillation of desiredfrequency) and as a frequency meter (input: oscillation, output:frequency indication). The network has two extra units, each of whichcan be used either as an input or as an output unit. During training,both units are treated formally as output units, in the sense that twoteacher signals are presented simultaneously: the target frequency andan oscillation of that frequency.

In the training phase, the first training channel is a slowly changingsignal that varies smoothly but irregularly between 0.1 and 0.3 (FIG. 7a). The other training channel is a fast sine oscillation whosefrequency is varying according the first signal (FIG. 7 b, the apparentamplitude jitter is a discrete-sampling artifact).

The network was set up with a DR of 100 units. The connection weightmatrix W was a band matrix with a width 5 diagonal band (i.e., w_(ji)=0if |j−i|≧3). This band structure induces a topology on the units. Thenearer two units (i.e., the smaller |j−i| mod 100), the more directtheir coupling. This locality lets emerge locally different activationpatterns. FIG. 7 c shows the impulse responses of every 5th unit(impulse input at timestep 10). The weights within the diagonal bandwere preliminarily set to +1 or −1 with equal probability. The weightswere then globally and uniformly scaled until the resulting DR dynamicswas marginally globally stable. This scaling resulted in weights of±0.3304 with a stability margin of δ=0.0025 (stability margins aredefined in the detailed description of preferred embodiments later inthis document).

Additionally, the two extra units were equipped with feedbackconnections which projected back into the DR. These connections wereestablished randomly with a connectivity of 0.5 for each of the twoextra units. The weights of these feedback connections were chosenrandomly to be ±1.24 for the first extra unit and ±6.20 for the secondextra unit.

The network state was randomly initialized, and the network was run for1100 steps for training. Two signals of the same kind as shown in FIGS.7 a,b were presented to the network (the target frequency signal to thefirst extra unit and the oscillation to the second) and the correctteacher output was written into the two output nodes (teacher forcing).The update of DR units was done with a small additive noise according toEq. (1′). The noise was sampled from a uniform distribution over [−0.02,0.02]. Data from the first 100 update steps were discarded. Data fromthe remaining 1000 update steps with teacher forcing were collected andused to obtain a linear regression solution of the least mean squareerror Eq. (4). The result were weights for the connections from DR unitsto the two output units.

In the exploitation phase, the trained RNN was used in either of twoways, as a frequency generator or as a frequency meter. In theexploitation phase, the no-noise version Eq. (1) of the update rule wasused.

In the frequency generator mode of exploitation, the first extra unitwas treated as a input unit, and the second as an output unit. A targetfrequency signal was fed into the input unit, for instance the 400timestep staircase signal shown in FIG. 7 d. At the second extra unit,here assigned to the output role, an oscillation was generated by thenetwork. FIG. 7 e shows an overlay of an oscillation of the correctfrequency demanded by the staircase input (solid line) with the outputactually generated by the network (dashed line). FIG. 7 f shows anoverlay of the frequency amplitudes (absolutes of Fourier transforms) ofthe output signal (solid line) and the network-generated output (dashedline). It appears from FIGS. 7 e,f that the network has learnt togenerate oscillations of the required frequencies, albeit with frequencydistortions in the low and high end of the range. FIG. 7 g shows tracesof 8 arbitrarily selected units of the DR. They exhibit oscillations ofthe same frequency as the output signal, transposed and scaled in theiramplitude range according to the input signal.

In the frequency meter mode of exploitation, the second extra unit wasused as an input unit into which oscillations of varying frequency arewritten. The first extra unit served now as the output unit. FIG. 7 hshows an input signal. FIG. 7 i presents an overlay of the perfectoutput (solid line) with the actually generated output (dashed line).The network has apparently learnt to serve as a frequency meter,although again with some distortion in the low and high ends of range. Atrace plot of DR units would look exactly like in the frequencygenerator mode and is omitted.

The challenge in this example is twofold. First, the network had tolearn not an output dynamics per se, but rather “discover” the dynamicalrelationship between the two training signals. Second, the time scalesof the two signals are very different: the frequency target isessentially stationary, while the oscillation signal changes on a fasttimescale. A bidirectional information exchange between signals ofdifferent timescales, which was requested from the trained network,presents a particular difficulty. Using a noisy update rule duringlearning was found to be indispensable in this example to obtain stabledynamics in the trained network.

This example is an instance of another dependent claim of the invention,namely, to use the method of the invention to train an RNN on thedynamic relationship between several signals. More specifically, theclaim is (1) to present training data {tilde over (y)}₁(t), . . . ,{tilde over (y)}_(n)(t) to n extra units of a DR architecture accordingto the invention, where these extra units have feedback connections tothe DR, (2) train the network such that the mean square error from Eq.(4) is minimized, and then (3) exploit the network in any “direction” byarbitrarily declaring some of the units as input units and the remainingones as output units.

Discussion of Examples

The examples highlight what the invariant, independent core of theinvention is, and what are dependent variants that yield alternativeembodiments.

Common aspects in the examples are:

-   -   use of a DR, characterized by the following properties:        -   its weights are not changed during learning        -   its weights are globally scaled such that a marginally            globally stable dynamics results        -   the DR is designed with the aim that the impulse responses            of different units be different        -   the number of units is greater than would strictly be            required for a minimal-size RNN for the respective task at            hand (overcomplete basis aspect)    -   training only the DR-to-output connection weights such that the        mean square error from Eq. (4) is minimized over the training        data.

The examples exhibit differences in the following aspects:

-   -   The network may have a topological/spatial structure (2        dimensional grid in the excitable medium example and band matrix        induced locality in two-way device example) or may not have such        structuring (other examples).    -   The required different impulse responses of DR units can be        achieved by explicit design of the DR (excitable medium example)        or by random initialization (other examples).    -   The update law of the network can be the standard method of        equation (1) (short term memory, excitable medium example) or        other (leaky integration update rule in chaotic oscillator,        noisy update in two-way device).    -   The computation of the DR-to-output connection weights can be        done offline (short term memory, excitable medium, two-way        device) or on-line (chaotic oscillator), using any standard        method for mean square error minimization.

DETAILED DESCRIPTION OF THE INVENTION AND PREFERRED EMBODIMENTS

Preferred embodiments of the invention are now described in detail. Likein the Summary of the Invention, the detailed description is organizedby presenting first the architectural and setup aspects, and then theprocedural aspects of the learning method.

Setup of the DR

A central architectural aspect of the invention is the provision of theDR whose weights are fixed and are not changed by subsequent learning.The purpose of the DR for the learning method of this invention is toprovide a rich, stable, preferably long-lasting excitable dynamics. Theinvention provides the following methods to realize this goal.

Rich Dynamics through Large Network Size

Preferred embodiments of the invention have relatively large DRs toprovide for a rich variety of different unit dynamics. 50 units and(many) more would be typical cases, less than 50 units would be suitableonly for undemanding applications like learning simple oscillators.

Rich Dynamics through Inhomogeneous Network Structure

Preferred embodiments of the invention achieve a rich variety in theimpulse responses of the DR units by introducing inhomogeneity into theDR. The following strategies, which can be used singly or incombination, contribute to the design goal of inhomogeneity:

-   -   realize an inhomogeneous connectivity structure in the DR,        -   by constructing the DR connection randomly and sparsely,        -   by using a band-structured connectivity matrix, which leads            to spatial decoupling of different parts of the DR (strategy            not used in above examples),        -   by imposing some other internal structuring on the DR            topology, e.g. by arranging its units in layers or modules,    -   equip DR units with different response characteristics, by        giving them        -   different transfer functions,        -   different time constants,        -   different connection weights.            Marginally Stable Dynamics through Scaling

A preferred method to obtain a DR with a globally stable dynamics is tofirst construct an inhomogeneous DR according to the previouslymentioned preferred embodiments, and then globally scale its weights bya common factor α which is selected such that

-   -   1. the network dynamics is globally stable, i.e. from any        starting activation the dynamics decays to zero, and    -   2. this stability is only marginal, i.e. the network dynamics        becomes unstable if the network weights are further scaled by a        factor α′=1+δ, which is greater than unity by a small margin.

When δ in the scaling factor α′=1+δ is varied, the network dynamicsundergoes a bifurcation from globally stable to some other dynamics at acritical value δ_(crit). This value was called the stability margin inthe examples above. The only method currently available to determine thestability margin of a given scaling factor is by systematic search.

Tuning Duration of Short-Term Memory through Tuning Marginality ofStability

In many applications of RNNs, a design goal is to achieve a longshort-term memory in the learnt RNN. This design goal can be supportedin embodiments of the invention by a proper selection of the stabilitymargin of the DR.

The smaller the stability margin, the longer the effective short-termmemory duration. Therefore, the design goal of long-lasting short-termmemory capabilities can be served in embodiments of the invention bysetting the stability margin to small values. In typical embodiments,where maximization of short-term memory duration is a goal, values of δsmaller than 0.1 are used.

Presenting Input to the DR

In the field of artificial neural networks, by far the most common wayto present input to networks is by means of extra input units. Thisstandard method has been used in the above examples. Alternative methodsto feed input into a RNN are conceivable, but either are essentiallynotational variants of extra input units (e.g., adding input terms intothe DR unit activation update equation Eq. (1)) or are very rarely used(e.g., modulating global network parameters by input). Any method iscompatible with the method of the invention, as long as the resultingdynamics of the DR is (1) significantly affected by the input, (2) therequired variability of individual DR unit's dynamics is preserved.

The most common way of presenting input (by extra input units) is nowdescribed in more detail.

According to the method of the invention, the connectivity pattern frominput units to the DR network, and the weights on theseinput-DR-connections, are fixed at construction time and are notmodified during learning.

In preferred embodiments of the invention, the input-DR-connections andtheir weights are fixed in two steps. In step 1, the connectivitypattern is determined and the weights are put to initial values. In step2, the weight values are globally scaled to maximize performance. Thesetwo steps are now described in more detail.

Step 1: Establish input-to-DR connections and put their weights toinitial values. The design goal to be achieved in step 1 is to ensure ahigh variability in the individual DR units' responses to input signals.This goal is reached, according to the method of the invention, byfollowing the following rules, which can be used in any combination:

-   -   a Provide connections sparsely, i.e., put zero weights to many        or most of the possible connections from an output unit to DR        units.    -   Select the feedback weights of non-zero connections randomly by        sampling from a probability distribution (as in the chaotic        oscillator learning example).    -   Assign different signs to the feedback weights of non-zero        connections, i.e. provide both inhibitory and excitatory        feedback connections.

Step 2: Scale the weights set in step 1 globally. The goal of step 2 isto optimize performance. No general rule can be given. According to thespecific purpose of the network, different scaling ranges can beoptimal, from very small to very large absolute weights. It will behelpful to observe the following rules, which are given here for theconvenience of the user. They are applicable in embodiments where theupdate rule of the DR network employs nonlinear (typically, sigmoid)transfer functions.

-   -   Large weights are preferred for fast, high-frequency I/O        response characteristics, small weights for slow signals or when        some lowpass characteristics are desired. For instance, in        training a multistable (multiflop) memory network (not described        in this document), where the entire network state had to switch        from one attractor to another through a single input impulse,        quite large input-to-DR weights with values of ±5.0 were used.    -   Large weights are preferred when highly nonlinear, “switching”        I/O dynamics are desired, small weights are preferred for more        linear I/O-dynamics.    -   Large weights are preferred for tasks with low temporal memory        length requirements (i.e., output at time t depends        significantly only on few preceding inputs and outputs), small        weights for long temporal memory effects. For instance, in the        delay line example (where large memory length was aimed for),        very small input-to-DR weights of ±0.001 were used.    -   If there are many input channels, channels whose        input-DR-connections have greater absolute weights are        emphasized in their influence on the system output compared to        low-weight channels.        Reading Output from the Network in the Exploitation Phase

According to the method of the invention, output is read from thenetwork always from output units. During the exploitation phase, thej-th output y_(j)(t+1) (j=1, . . . , m) is obtained from the j-th outputunit by an application of the update rule Eq. (1), i.e. byy_(j)(t+1)=f_(j)<w_(j),x(t)>, where the inner product <w_(j),x(t)>denotes the sum of weighted activations of input units u(t), DR unitsx(t), and output units y(t):w_(j1)u₁(t)+ . . . +w_(jn)u_(n)(t)+w_(j,n+1)x₁(t)+ . . .+w_(j,n+K)x_(K)(t)+w_(j,n+K+1)y₁(t)+ . . . +w_(j,n+K+m)y_(m)(t),passed through the transfer function ƒ_(j) of the j-th output unit. Intypical embodiments, ƒ_(j) is a sigmoid or a linear function.Feedback Connections from Output Units to the DR

Depending on the desired task, the method of the invention provides twoalternatives concerning feedback from the output units to the DR: (a)the network can be set up without such connections, (b) the network canbe equipped with such connections. Embodiments of the invention of type(a) will typically be employed for passive filtering tasks, while case(b) typically is required for active signal generation tasks. However,feedback connections can also be required in filtering tasks, especiallywhen the filtering task involves modeling a system with an autonomousstate dynamics (as in the two-way device example). This situation isanalogous, in linear signal processing terminology, to infinite impulseresponse (IIR) filters. However, this terminology is commonly used forlinear filters. RNNs yield nonlinear filters. Therefore, in this patentapplication another terminology shall be used. RNNs which have input andfeedback connections from the output units will be referred to asserving active filtering tasks.

According to the method of the invention, when feedback connections areused (i.e., in signal generation or active filtering tasks), they arefixed at the design time of the network and not changed in thesubsequent learning.

The setup of output-to-DR feedback connections is completely analogousto the setup of input-to-DR connections, which was described in detailabove. Therefore, it suffices here to repeat that in a preferredembodiment of the invention, the output-to-DR feedback connections aredesigned in two steps. In the first step, the connectivity pattern andan initial set of weights are fixed, while in the second step theweights are globally scaled. The design goals and heuristic rulesdescribed for input-to-DR connections apply to output-to-DR connectionswithout change, and need not be repeated.

Optimizing the Output MSE by Training the DR-to-Output Weights

After the network has been set up by providing a DR network and suitableinput- and output facilities, as related above, the method of theinvention proceeds to determine the weights from DR units (and alsopossibly from input units, if they are provided) to the output units.This is done through a supervised training process.

Training Criterium: Minimizing Mean Square Output Error

The weights of connections to output units are determined such that themean square error Eq. (4) is minimized over the training data. Equation(4) is here repeated for convenience:

$\begin{matrix}{{E\left\lbrack ɛ_{j}^{2} \right\rbrack} = {\frac{1}{N - 1}{\sum\limits_{t = 1}^{N - 1}{\left( {{f^{- 1}\left( {{\overset{\sim}{y}}_{j}\left( {t + 1} \right)} \right)} - \left\langle {w_{j},{x(t)}} \right\rangle} \right)^{2}.}}}} & (4)\end{matrix}$

In (4), {tilde over (y)}_(j)(t) is the desired (teacher) output of thej-th output unit, to which the inverse of the transfer function ƒ_(j) ofthis unit is applied. The term <w_(j),x(t)> denotes the inner productw_(j1)u₁(t)+ . . . +w_(jn)u_(n)(t)+w_(j,n+1)x₁(t)+ . . .+w_(j,n+K)x_(K)(t)+w_(j,n+K+1)y₁(t)+ . . . +w_(j,n+K+m)y_(m)(t),   (5)[reapeated]where u_(i)(t) are activations of input units (if applicable), x_(i)(t)of DR units, and y_(i)(t) of output units.

In alternative embodiments of the invention which employ online adaptivemethods, instead of mining the MSE Eq. (4), it is also possible tominimize the following mean square error:

$\begin{matrix}{{E\left\lbrack ɛ_{j}^{2} \right\rbrack} = {\frac{1}{N - 1}{\sum\limits_{t = 1}^{N - 1}{\left( {{{\overset{\sim}{y}}_{j}\left( {t + 1} \right)} - {f_{j}\left\langle {w_{j},{x(t)}} \right\rangle}} \right)^{2}.}}}} & \left( 4^{\prime} \right)\end{matrix}$

The theoretical difference between the two variants is that in the firstcase (Eq. (4)), the learning procedure will minimize output unit stateerror, while in the second case, output value error is minimized. Inpractice this typically does not make a significant difference, becauseoutput unit state and output value are directly connected by thetransfer function. In the examples described in the examples section,version (4) was used throughout.

In yet alternative embodiments of the invention, the MSE to be minimizedrefers only to a subset of the input, DR, and output units. Moreprecisely, in these alternative embodiments, the MSE

$\begin{matrix}{{{E\left\lbrack ɛ_{j}^{2} \right\rbrack} = {\frac{1}{N - 1}{\sum\limits_{t = 1}^{N - 1}\left( {{f^{- 1}\left( {{\overset{\sim}{y}}_{j}\left( {t + 1} \right)} \right)} - \left\langle {{s \cdot w_{j}},{x(t)}} \right\rangle} \right)^{2}}}}{or}} & \left( 4^{*} \right) \\{{E\left\lbrack ɛ_{j}^{2} \right\rbrack} = {\frac{1}{N - 1}{\sum\limits_{t = 1}^{N - 1}\left( {{{\overset{\sim}{y}}_{j}\left( {t + 1} \right)} - {f_{j}\left\langle {{s \cdot w_{j}},{x(t)}} \right\rangle}} \right)^{2}}}} & \left( 4^{\prime*} \right)\end{matrix}$is minimized, where s is a vector of the same length as w_(j),consisting of 0's and 1's, and r·s=(r₁ . . . r_(k))·(s₁ . . .s_(k))=(r₁s₁ . . . r_(k)s_(k)) denotes elementwise multiplication. Theeffect of taking <s·w_(j),x(t)> instead of <w_(j),x(t)> is that only theinput/DR/output units selected by the selection vector s are used forminimizing the output error. The connection weights from thoseinput/DR/output units which are marked by 0's in s, to the output units,are put to zero. Specifically, variants (4*) or (4′*) can be used topreclude the learning of output-to-output connections. Variant (4*) wasused in the examples “short-time memory” and “feedback controller”(precluding output-to-output feedback), and in the “excitable medium”example (extensive use of (4*) for defining the local neighborhoodsshown in FIGS. 4 a,b).Training Method: Supervised Teaching with Teacher Forcing

According to the method of the invention, the MSE (4), (4′), (4*) or(4′*) is minimized through a procedure of supervised teaching. Atraining sequence consisting of an input time series u(t) and a(desired) output time series {tilde over (y)}(t) must be available,where t=1,2, . . . , N. The input sequence u(t) may be absent when thelearning task is to learn a purely generative dynamics, as in the Lorenzattractor and the excitable medium examples.

According to the method of the invention, the activations of the DR areinitialized at time t=1. Preferably, the DR activations are initializedto zero or to small random values.

The method of the invention can be used for constructive offlinelearning and for adaptive online learning. The method of the inventioncan be adjusted to these two cases, as detailed below. However, severalaspects of the invention are independent from the online/offlinedistinction.

According to one aspect which is independent from the online/offlinedistinction, the input training sequence u(t) is fed into the DR fort=1,2, . . . , N.

According to another aspect of the invention which is independent fromthe online/offline distinction, the output training sequence {tilde over(y)}(t)=({tilde over (y)}₁(t), . . . , {tilde over (y)}_(m)(t)) iswritten into the m output units, i.e., the activation y_(j)(t) of thej-th output unit (j=1, . . . , m) at time t is set to {tilde over(y)}_(j)(t). This is known in the RNN field as teacher forcing. Teacherforcing is essential in cases where there are feedback connections fromthe output units to the DR In cases where such feedback connections arenot used, teacher forcing is inconsequential but assumed nonetheless forthe convenience of a unified description of the method.

According to another procedural aspect of the invention which isindependent from the online/offline distinction, the DR units areupdated for time steps t=1,2, . . . , N. The particular update law isirrelevant for the method of the invention. The repeated update of theDR generates an activation vector sequence x(1), . . . , x(N), wherex(t) is a vector containing the activations of the network's units(including input units but excluding output units) at time t.

In preferred embodiments of the invention, a small amount of noise isadded to the network dynamics during the training phase. One method toadd noise is to use update equation (1′), i.e. add a noise term to eachnetwork state at each update time. An alternative method to introducenoise is to add noise to the input signals u(t) and/or {tilde over(y)}(t). More specifically, instead of writing u(t) into the inputunits, write u(t)+v(t) into them; and instead of teacher-forcing {tildeover (y)}(t) into the output units, write {tilde over (y)}(t)+v(t) intothe output units (v(t) is a noise term). Note however that when a noisysignal {tilde over (y)}(t)+v(t) is used for teacher-forcing, theto-be-minimized MSE still refers to the non-noisified versions of thetraining output, i.e. to the chosen variant of Eq. (4).

Adding noise to the network dynamics is particularly helpful in signalgeneration and active signal processing tasks, where output-to-DRfeedback connections are present. In such cases, the added noiserandomly excites such internal units which have no stable systematicdynamic relationship with the desired I/O behavior; as a consequence,weights from such “unreliable” units to the output units receive verysmall values from the learning procedure. The net effect is that theresulting trained network behaves more robustly (i.e., less susceptibleto perturbations). Adding noise was found to be indispensable in the“two-way device” example.

Adding noise is also beneficial in cases where the training data set isnot much larger than the network size. In such cases, there is danger ofoverfitting the training data, or stated in an alternative way: it isthen difficult to achieve good generalization performance. Insertion ofnoise prevents the network from fitting to idiosyncrasies in thetraining data, thereby improving generalization. Adding noise tocounteract overfitting was a necessity in the “pendulum control”example, where only a small part of the plant's control regime wasvisited during training, but still a reasonably generalized performancewas achieved.

Further aspects of the invention are specific for the alternative casesof off-line learning and on-line learning. Detailed descriptions followof how the method of the invention works in the two cases.

Description of One Update Step for Data Collection in the Training Phase(Offline Case)

When the method of the invention is used for offline learning, thetraining data are presented to the network for t=1,2, . . . , N, and theresulting network states during this period are recorded. After time N,these data are then used for offline construction of MSE-minimizingweights to the output units. According to the method of the invention,the following substeps must be performed to achieve one complete updatestep.

Input to Update Step t→t+1:

-   -   1. DR units activation state x₁(t), . . . , x_(K)(t)    -   2. output units activation state y₁(t), . . . , y_(m)(t)        (identical to teacher signal {tilde over (y)}₁(t), . . . ,        {tilde over (y)}_(m)(t))    -   3. input signal u₁(t+1), . . . , u_(n)(t+1) [unless the task is        a pure signal generation task without input]    -   4. teacher output {tilde over (y)}₁(t+1), . . . , {tilde over        (y)}_(m)(t+1)        Output After Update Step t→t+1:    -   1. DR units activation state x₁(t+1), . . . , x_(K)(t+1)        Side Effect of Update Step t→t+1:    -   1. Network state vector x(t+1) and teacher output {tilde over        (y)}₁(t+1), . . . , {tilde over (y)}_(m)(t+1) are written to        memory        Substeps:    -   1. [unless the task is a pure signal generation task without        input] Feed input u₁(t+1), . . . , u_(n)(t+1) to the network,        using the chosen input presentation method. When input is fed        into the network by means of extra input units (the standard        way), this means that the activations of the n input units are        set to u₁(t+1), . . . , u_(n)(t+1). The total network state is        now u₁(t+1), . . . , u_(n)(t+1),x₁(t), . . . ,        x_(K)(t),y_(n)(t), . . . , y_(n)(t) [in the case when input        units are used; otherwise omit the first u₁(t+1), . . . ,        u_(n)(t+1)].    -   2. Update the state of the DR units, by applying the chosen        update rule. For instance, when Eq. (4) is used, for every i=1 .        . . , K evaluate

$\begin{matrix}{{x_{i}\left( {t + 1} \right)} = {f_{i}\left( {{w_{i1}{u_{1}\left( {t + 1} \right)}}\; + \ldots + {w_{i\; n}{u_{n}\left( {t + 1} \right)}} +} \right.}} \\{{w_{i,{n + 1}}{x_{1}(t)}} + \ldots + {w_{i,{n + K}}{x_{K}(t)}} +} \\\left. {{w_{i,{n + K + 1}}{y_{1}(t)}} + \ldots + {w_{i,{n + K + m}}{y_{m}(t)}}} \right)\end{matrix}$

-   -   3. Write x(t+1)=u₁(t+1), . . . , u_(n)(t+1),x₁(t+1), . . . ,        x_(K)(t+1),y₁(t), . . . , y_(n)(t) and {tilde over (y)}₁(t+1), .        . . , {tilde over (y)}_(m)(t+1) into a memory for later use in        the offline computation of optimal weights. [In cases where the        MSE to be minimized is of form (4*), write into memory        x(t+1)=s·(u₁(t+1), . . . , u_(n)(t+1),x₁(t+1), . . . ,        x_(K)(t+1),y₁(t), . . . , y_(n)(t))]    -   4. Write the teacher signal {tilde over (y)}₁(t+1), . . . ,        {tilde over (y)}_(m)(t+1) into the output units (teacher        forcing), i.e. put y₁(t+1), . . . , y_(m)(t+1)={tilde over        (y)}_(m)(t+1), . . . , {tilde over (y)}_(m)(t+1).        Description of the Optimal Weight Computation in the Offline        Case

At time N, N state-teacher output pairs x(t), {tilde over (y)}₁(t), . .. , {tilde over (y)}_(m)(t) have been collected in memory. The method ofthe invention proceeds now to compute weights w_(j,i) from all unitswhich have entry 1 in the selection vector s_(EINBETTEN)x(t) to the joutput units. These weights are computed such that the chosen variant ofMSE (e.g., (4) or (4*)) is minimized. Technically, this is a linearregression task, for which many efficient methods are available.(Technical data analysis software packages, like MatLab, Mathematica,LinPack, or statistical data analysis packages, all contain highlyrefined linear regression procedures. For the production of the examplesdescribed in this document, the FIT procedure of Mathematica was used).Because the particular way of how this linear regression is performed isnot part of the invention, and because it will not present anydifficulties to the practicing in the field, only the case that the MSE(4) is minimized is briefly treated here.

As a preparation, it is advisable to discard some initial state-teacheroutput pairs, accommodating for the fact initial transients in thenetwork should die out before data are used for training. After this,for each output unit j, consider the argument-value vector data set(x(t),f⁻¹({tilde over (y)}_(j)(t)))_(t=t) ₀ _(, . . . N). Compute linearregression weights for least mean square error regression of the valuesf⁻¹({tilde over (y)}_(j)(t)) on the arguments x(t), i.e. compute weightsw_(j,i) such that the MSE Eq. (4) is minimized.

Write these weights into the network, which is now ready forexploitation.

Description of One Update Step in the Exploitation Phase

When the trained network is exploited, input u₁(t), . . . , u_(n)(t) isfed into it online [unless it is a pure signal generation device], andthe network produces output y₁(t), . . . , y_(m)(t) in an online manner.For convenience, a detailed description of an update step of the networkduring exploitation is given here.

Input to Update Step t→t+1:

-   -   1. DR units activation state x₁(t), . . . , x_(K)(t)    -   2. output units activation state y₁(t), . . . , y_(m)(t)    -   3. input signal u₁(t+1), . . . , u_(n)(t+1) [unless the task is        a pure signal generation task without input]        Output After Update Step t→t+1:    -   1. DR units activation state x₁(t+1), . . . , x_(K)(t+1)    -   2. output units activation state y₁(t+1), . . . , y_(m)(t+1)        Substeps:    -   1. [unless the task is a pure signal generation task without        input] Feed input u₁(t+1), . . . , u_(n)(t+1) to the network.    -   2. Update the state of the DR units, by applying the chosen        update rule. For instance, when Eq. (4) is used, for every i=1,        . . . , K evaluate

$\begin{matrix}{{x_{i}\left( {t + 1} \right)} = {f_{i}\left( {{w_{i1}{u_{1}\left( {t + 1} \right)}}\; + \ldots + {w_{i\; n}{u_{n}\left( {t + 1} \right)}} +} \right.}} \\{{w_{i,{n + 1}}{x_{1}(t)}} + \ldots + {w_{i,{n + K}}{x_{K}(t)}} +} \\\left. {{w_{i,{n + K + 1}}{y_{1}(t)}} + \ldots + {w_{i,{n + K + m}}{y_{m}(t)}}} \right)\end{matrix}$

-   -   3. Update the states of the output units, by applying the chosen        update rule. For instance, when Eq. (4) is used, for every j=1 .        . . , m evaluate

$\begin{matrix}{{y_{j}\left( {t + 1} \right)} = {f_{j}\left( {{w_{j\; 1}{u_{1}\left( {t + 1} \right)}} + \ldots + {w_{j\; n}{u_{n}\left( {t + 1} \right)}} + {w_{j,{n + 1}}{x_{1}\left( {t + 1} \right)}} + \ldots +} \right.}} \\{{w_{j,{n + K}}{x_{K}\left( {t + 1} \right)}} + {w_{j,{n + K + 1}}{y_{1}(t)}} + \ldots +} \\\left. {w_{j,{n + K + m}}{y_{m}(t)}} \right)\end{matrix}$

The important part to note here is the “cascaded” update: first the DRunits are updated in substep 2, then the output units are updated insubstep 3. This corresponds to a similarly “cascaded” update in thetraining phase.

Variations

In updating recurrent neural networks with extra input- and outputunits, there is a some degree of liberty in the particular relativeupdate order of the various types of units (input, DR, output). Forinstance, instead of the particular “cascaded” update described above,in alternative embodiments the DR units and output units can be updatedsimultaneously, resulting in slightly (but typically not significantly)different network behavior. In yet other alternative embodiments, wherethe DR is endowed with a modular or layered substructure, more complexupdate regulations may be required, updating particular regions of thenetwork in a particular order. The important thing to observe for themethod of the invention is that whichever update scheme is used, thesame scheme must be used in the training and in the exploitation phase.

Description of One LMS Update Step for Online Adaptation

In contrast to the offline variants of the method, online adaptationmethods can be used both for minimizing output state error (MSE criteria(4), (4*)) and for minimizing output value error (MSE criteria (4′),(4′*)).

In online adaptation, the weights w_(j,i) to the j-th output unit areincrementally optimized at every time step, thereby becomingtime-dependent variables w_(j,i)(t) themselves. A host of well-knownmethods for online MSE-minimizing adaptation can be used for the methodof the invention, for instance stochastic gradient descent methods likethe LMS method or Newton's method (or combinations thereof), orso-called “deterministic” methods like the RLS method.

Among these, the LMS method is by far the simplest. It is not optimallysuited for the method of the invention (the reasons for this have beenindicated in the discussion of the Lorenz attractor example).Nonetheless, owing to its simplicity, LMS is the best choice for adidactical illustration of the principles of the online version of themethod of the invention.

Here is a description of one update step, using the LMS method tooptimize weights.

Input to Update Step t→t+1:

-   -   1. DR units activation state x₁(t), . . . , x_(K)(t)    -   2. output units activation state y₁(t), . . . , y_(m)(t)    -   3. input signal u₁(t+1), . . . , u_(n)(t+1) [unless the task is        a pure signal generation task without input]    -   4. teacher output {tilde over (y)}₁(t+1), . . . , {tilde over        (y)}_(m)(t+1)    -   5. weights w_(j,i)(t) of connections to the output units        Output After Update Step t→t+1:    -   1. DR units activation state x₁(t+1), . . . , x_(K)(t+1)    -   2. output units activation state y₁(t+1), . . . , y_(m)(t+1)    -   3. new weights w_(j,i)(t+1)        Substeps:    -   1. [unless the task is a pure signal generation task without        input] Feed input u₁(t+1), . . . , u_(n)(t+1) to the network.    -   2. Update DR units, by applying the chosen update rule. For        instance, when Eq. (4) is used, for every i=1, . . . , K        evaluate

$\begin{matrix}{{x_{i}\left( {t + 1} \right)} = {f_{i}\left( {{w_{i\; 1}{u_{1}\left( {t + 1} \right)}} + \ldots + {w_{i\; n}{u_{n}\left( {t + 1} \right)}} + {w_{i,{n + 1}}{x_{1}(t)}} + \ldots +} \right.}} \\{{w_{i,{n + K}}{x_{K}(t)}} + {w_{i,{n + K + 1}}{y_{1}(t)}} + \ldots +} \\\left. {w_{i,{n + K + m}}{y_{m}(t)}} \right)\end{matrix}$

-   -   3. Update the states of the output units, by applying the chosen        update rule. For instance, when Eq. (4) is used, for every j=1,        . . . , m evaluate

$\begin{matrix}{{y_{j}\left( {t + 1} \right)} = {f_{j}\left( {{w_{j\; 1}{u_{1}\left( {t + 1} \right)}} + \ldots + {w_{j\; n}{u_{n}\left( {t + 1} \right)}} + {w_{j,{n + 1}}{x_{1}\left( {t + 1} \right)}} + \ldots +} \right.}} \\{{w_{j,{n + K}}{x_{K}\left( {t + 1} \right)}} + {w_{j,{n + K + 1}}{y_{1}(t)}} + \ldots +} \\\left. {w_{j,{n + K + m}}{y_{m}(t)}} \right)\end{matrix}$

-   -   4. For every output unit j=1, . . . , m, update weights        w_(j)(t)=(w_(j,1)(t), . . . , w_(j,n+K+m)(t)) to w_(j)(t+1),        according to the adaptation method chosen. Here the LMS method        is described as an example. It comprises the following substeps:        -   a. Compute the error ε_(j)(t+1)={tilde over            (y)}_(j)(t+1)−y_(j)(t+1). [Note: this yields an output value            error, and consequentially, the MSE of Eq. (4′) will be            minimized. In order to minimize the output state error, use            ε_(j)(t+1)=f_(j) ⁻¹({tilde over (y)}_(j)(t+1))−f_(j)            ⁻¹(y_(j)(t+1)) instead.]        -   b. Put w_(j)(t+1)=w_(j)(t)+με_(j)(t+1)x(t), where μ is a            learning rate and x(t) is the total network state (including            input and output units) obtained after step 3.    -   5. If there are output-to-DR feedback connections, write the        teacher signal {tilde over (y)}₁(t+1), . . . , {tilde over        (y)}_(m)(t+1) into the output units (teacher forcing), i.e. put        y₁(t+1), . . . , y_(m)(t+1)={tilde over (y)}₁(t+1), . . . ,        {tilde over (y)}_(m)(t+1)

Like in the offline version of the method of the invention, many trivialvariations of this update scheme exist, distinguished from each othere.g. by the update equation (which version of Eq. (4)), by theparticular order in which parts of the network are updated in a cascadedfashion, by the specific method in which input is administered, etc.These variations are not consequential for the method of the invention;the above detailed scheme of an update step is only an illustration ofone possibility.

1. A method for constructing a discrete-time recurrent neural networkand training it in order to minimize its output error, comprising;constructing a recurrent neural network as a reservoir for excitabledynamics (dynamical reservoir network; providing means of feeding inputto the dynamical reservoir network; attaching output units to thedynamical reservoir network through weighted connections; and trainingthe weights of the connections only from the dynamical reservoir networkto the output units in a supervised training scheme.
 2. The method ofclaim 1, wherein the dynamical reservoir network has a number of unitsgreater than
 50. 3. The method of claim 1 or 2, wherein the dynamicalreservoir network is sparsely connected.
 4. The method of claim 1,wherein the connections within the dynamical reservoir network haverandomly assigned weights.
 5. The method of claim 1, wherein differentupdate rules or differently parameterized update rules are used fordifferent dynamical reservoir units.
 6. The method of claim 1, wherein aspatial structure is imprinted on the dynamical reservoir networkthrough the connectivity pattern.
 7. The method of claim 6, wherein thespatial structure is a regular grid.
 8. The method of claim 6, whereinthe spatial structure is a local neighborhood structure induced bybanded or subbanded structure of the connectivity matrix.
 9. The methodof claim 6, wherein the spatial structure is modular or organized inlevels.
 10. The method of claim 1, wherein the weights within thedynamical reservoir are globally scaled such that the resulting dynamicsof the isolated dynamical reservoir network is globally stable.
 11. Themethod of claim 1, wherein the weights within the dynamical reservoirare globally scaled such that the resulting dynamics of the isolateddynamical reservoir network is marginally globally stable, in order toachieve long duration of memory effects in the final network aftertraining.
 12. The method of claim 10 or 11, wherein input is fed to thedynamical reservoir by means of extra input units.
 13. The method ofclaim 12, wherein the connections from the input units to the dynamicalreservoir are sparse.
 14. The method of claim 12, wherein the weights ofconnections from the input units to the dynamical reservoir are randomlyfixed and have negative and positive signs.
 15. The method of claim 12,wherein the weights of connections from the input units to the dynamicalreservoir are globally scaled to small absolute values in order toachieve a long duration of memory effects in the final network I/Ocharacteristics, or in order to achieve slow or low-pass timecharacteristics in the final network I/O characteristics, or in order toachieve nearly linear I/O characteristics.
 16. The method of claim 12,wherein the weights of connections from the input units to the dynamicalreservoir are globally scaled to absolute large values in order toachieve short duration of memory effects, or in order to achieve fastI/O behavior, or in order to achieve highly nonlinear or “switching”characteristics in the final trained network.
 17. The method of claim 10or 11, wherein input is fed to the dynamical reservoir by means otherthan by extra input units.
 18. The method of claim 1, wherein extraoutput units are attached to the dynamical reservoir without feedbackconnections from the output units to the dynamical reservoir, in orderto obtain a passive signal processing network after training.
 19. Themethod of claim 1, wherein extra output units are attached to thedynamical reservoir with feedback connections from the output units tothe dynamical reservoir, in order to obtain an active signal processingor signal generation network after training.
 20. The method of claim 19,wherein the feedback connections are sparse.
 21. The method of claim 19or 20, wherein the weights of feedback connections are randomly fixedand have negative and positive signs.
 22. The method of claim 19,wherein the weights of feedback connections are globally scaled to smallabsolute values in order to achieve a long duration of memory effects inthe final network I/O characteristics, or in order to achieve slow orlow-pass time characteristics in the final network I/O characteristics,or in order to achieve linear I/O characteristics.
 23. The method ofclaim 19, wherein the weights of connections from the input units to thedynamical reservoir are globally scaled to absolute large values inorder to achieve short duration of memory effects, or in order toachieve fast I/O behavior, or in order to achieve highly nonlinear or“switching” characteristics in the final trained network.
 24. The methodof claim 1, wherein the network is trained in an offline version ofsupervised teaching.
 25. The method of claim 24, wherein the task to belearnt is a signal generation task, no input exists, and the teachersignal consists only of a sample of the desired output signal.
 26. Themethod of claim 24, wherein the task to be learnt is a signal processingtask, where input exists, and where the teacher signal consists of asample of the desired input/output pairing.
 27. The methods of any oneof claims 24 to 26, wherein output-error-minimizing weights of theconnections to the output nodes are computed, comprising; presenting theteacher signals to the network and running the network in teacher-forcedmode for the duration of the teaching period, saving into a memory thenetwork states and the signals f.sub.j.sup.−1({tilde over (y)}.sub.j(t))obtained by mapping the inverse of the output unit's transfer functionon the teacher output, optionally discarding initial state/output pairsin order to accommodate initial transient effects, and computing theweights of the connections to the output nodes by a standard linearregression method.
 28. The method of claim 24, wherein during thetraining period noise is inserted into the network dynamics, byutilizing a noisy update rule and/or by adding noise on the input and/orby adding a noise component to the teacher output before it is fed backinto the dynamical reservoir if output to the dynamical reservoirfeedback connections exist.
 29. The methods of claim 24, wherein weightsof connection from only a subset of the networks units (i.e., a subsetof the input, dynamical reservoir, output units) to the output units aretrained, and the other ones are set to zero.
 30. The methods of claim 1,wherein the network is trained in an online version of supervisedteaching.
 31. The method of claim 30, wherein the task to be learnt is asignal generation task, no input exists, and the teacher signal consistsonly of a sample of the desired output signal.
 32. The method of claim30, wherein the task to be learnt is a signal processing task, whereinput exists, and where the teacher signal consists of a sample of thedesired input/output pairing.
 33. The method of any one of claims 30 to32, wherein output-error-minimizing weights of the connections to theoutput nodes are updated at every time step, the update comprising;feeding the input to the network and updating the network, for everyoutput unit, computing an error as the difference between the desiredteacher output and the actual network output; or, alternatively, as thedifference between the value f.sub.j.sup. −1({tilde over (y)}.sub.j(t))obtained by mapping the inverse of the output unit's transfer functionon the teacher output, and the value obtained by mapping the inverse ofthe output unit's transfer function on the actual output (output stateerror), updating the weights of the connections to the output nodes by astandard method for minimizing the error computed in the previoussubstep, and in cases of signal generation tasks or active signalprocessing tasks, forcing the teacher output into the output units. 34.The method of claim 30, wherein noise is inserted into the networkdynamics, by utilizing a noisy update rule or by adding a noisecomponent to the teacher output before it is fed back into the dynamicalreservoir if feedback connections exist.
 35. The method of claim 30,wherein weights of connection from only a subset of the networks units(i.e., a subset of the input, dynamical reservoir, output units) to theoutput units are trained, and the other ones set to zero.
 36. The methodof claim 1, wherein the network is trained on two or more output unitswith feedback connections to the dynamical reservoir, which in theexploitation phase are utilized in any chosen “direction”, by treatingany some of the trained units as input units and the remaining ones asoutput units to realize the learning of dynamical relationships betweensignals.
 37. The method of claim 36 applied to tasks of reconstructivememory of multidimensional dynamical patterns, comprising; training thenetwork with teaching signals consisting of complete-dimensional samplesof the patterns, and in the exploitation phase, presenting cue patternswhich are incompletely given in only some of the dimensions as input inthose dimensions, and reading out the completed dynamical patterns onthe remaining units.
 38. The method of claim 1, applied to tasks ofclosed-loop state or observation feedback tracking control of a plant,comprising using training samples consisting of two kinds of inputsignals to the network, namely, (i) a future version of the variablesthat will serve as a reference signal in the exploitation phase, and(ii) plant output or plant state observation; and consisting further ofa desired network output signal, namely, (iii) plant control input,training a network using the teacher input and output signal from a., inorder to obtain a network which computes as network output a plantcontrol input depending on the current plant output observation and afuture version of reference variables, exploiting the network as anclosed-loop controller by feeding it with the inputs (i) futurereference signals, (ii) current plant output or plant state observation;and letting the network generate the current plant control input.
 39. Aneural network for constructing a discrete-time recurrent neural networkand training it in order to minimize its output error comprising; arecurrent neural network as a reservoir for excitable dynamics dynamicalreservoir network; means for feeding input to the dynamical reservoirnetwork; output units attached to the dynamical reservoir networkthrough weighted connections; and wherein the weights of the connectionsare trained only from the dynamical reservoir network to the outputunits in a supervised training scheme.
 40. A neural network according toclaim 39, wherein it is implemented as a microcircuit.
 41. A neuralnetwork according to claim 39, wherein it is implemented by a suitablyprogrammed computer.